Skip to content

WIP: DecayRange#486

Draft
mjp41 wants to merge 2 commits intomicrosoft:mainfrom
mjp41:decayrange
Draft

WIP: DecayRange#486
mjp41 wants to merge 2 commits intomicrosoft:mainfrom
mjp41:decayrange

Conversation

@mjp41
Copy link
Member

@mjp41 mjp41 commented Mar 21, 2022

Implemenation of a range that gradually releases memory back to
the OS. It quickly pulls memory, but the dealloc_range locally caches
the memory and uses Pal timers to release it back to the next level
range when sufficient time has passed.

  • codify that parent range needs to be concurrency safe.
  • Remove unused code

@mjp41 mjp41 force-pushed the decayrange branch 2 times, most recently from 5ad1005 to f7e897a Compare March 21, 2022 21:31

namespace snmalloc
{
template<typename Rep>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the reuse of the large buddy range rep here, at least a comment (or a concept) might be in order.

}

// We have run out of memory.
handle_decay_tick(); // Try to free some memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be interlocked against the timer firing? I suppose not due to the prepend-only nature of all_local, the read-only nature of the spine traversal, and the use of pop_all for each found sizeclass... assuming that the parent range doesn't need interlocking, which, by default anyway, it doesn't (specifically, the parent will be a CommitRange whose parent is a GlobalRange by default, and CommitRange doesn't actually have state and GlobalRange is an interlock).

The presumption of concurrency-safeness of parent might merit being written down somewhere?

Copy link
Member Author

@mjp41 mjp41 Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I plan to add a static constexpr to all the types that is the concurrency safe property like currently happens with Align. I just hadn't threaded it through yet. So GlobalRange would be true, CommitRange would be whatever the parent says, and the buddy would be false.

@mjp41 mjp41 force-pushed the decayrange branch 6 times, most recently from f6254d6 to 93d6e3f Compare March 23, 2022 16:18
@mjp41
Copy link
Member Author

mjp41 commented Mar 23, 2022

So the perf of this is okay, but it increases memory footprint for some examples too much. I have factored out the primary changes to enable this into #491, so that can be landed, and the perf of this can be fixed and landed at a later point.

@mjp41 mjp41 force-pushed the decayrange branch 4 times, most recently from 37a1ce8 to 9694c96 Compare March 23, 2022 20:57
@mjp41
Copy link
Member Author

mjp41 commented May 13, 2022

This paper has a really interesting approach to work stealing of chunks between threads:
https://dl.acm.org/doi/10.1145/3533724

I think we could use some of the ideas in this paper, to make the decay range perform better.

@SchrodingerZhu
Copy link
Collaborator

BTW, I have recently worked on a Weak AVL Tree:

llvm/llvm-project#172411

which behaves in between of an AVL and a red black tree, adaptively based on insertion/deletion rate. If data structure performance is a concern, weak AVL may be worth a try.

Do we require the pointer stability of the node? If not, btree is almost always faster.

@mjp41
Copy link
Member Author

mjp41 commented Feb 24, 2026

BTW, I have recently worked on a Weak AVL Tree:

llvm/llvm-project#172411

which behaves in between of an AVL and a red black tree, adaptively based on insertion/deletion rate. If data structure performance is a concern, weak AVL may be worth a try.

Do we require the pointer stability of the node? If not, btree is almost always faster.

Oh, that is really interesting. We have a lot of constraints on the red-black tree code as it uses the pagemap as the storage for the nodes. This means it can only use 16bytes for a node, and about four bits of that are already reserved.

@SchrodingerZhu
Copy link
Collaborator

Then WAVL should be a drop-in solution. There are two variants, one uses one-bit to store parity and another one uses two-bit to store rank-diff-flags. The second is a little bit faster, perhaps because it does not need to access the bit in the children to recover the 'two-bit' information.

@mjp41
Copy link
Member Author

mjp41 commented Feb 24, 2026

Then WAVL should be a drop-in solution. There are two variants, one uses one-bit to store parity and another one uses two-bit to store rank-diff-flags. The second is a little bit faster, perhaps because it does not need to access the bit in the children to recover the 'two-bit' information.

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

@SchrodingerZhu
Copy link
Collaborator

SchrodingerZhu commented Feb 25, 2026

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

I can have a try, at least we can let some AI agent port it to bench first. Do you have specific instructions to replay the workload where rbtree consolidating is considered important.

@mjp41
Copy link
Member Author

mjp41 commented Feb 25, 2026

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

I can have a try, at least we can let some AI agent port it to bench first. Do you have specific instructions to replay the workload where rbtree consolidating is considered important.

This commit exercises the rbtree pretty heavily:

bf7a152

@SchrodingerZhu
Copy link
Collaborator

SchrodingerZhu commented Feb 25, 2026

According to some primitive test (with x4 iteration compared to original test file):

RBTree Replacement Benchmark Report (2026-02-25)

Scope

  • Tree backend variants:
    • 0: Red-Black tree (baseline)
    • 1: WAVL 2-bit diff
    • 2: WAVL 1-bit parity
  • Large alloc benchmark workload increased from 100000 to 400000 iterations (x4).
  • Repetitions increased to 10 runs per variant for statistics (mean/stddev/min/max).
  • Hyperfine also run with 10 repetitions.

Package + Remote Run

  • Archive: /tmp/snmalloc-rbtree-variants-20260225.zip
  • Uploaded to: spark:/tmp/snmalloc-rbtree-variants-20260225.zip
  • Remote workdir: /tmp/snmalloc-rbtree-variants-20260225

Environments

Local 10-Run Metric Stats (ns)

variant metric n mean_ns stddev_ns min_ns max_ns delta_vs_rb_ns delta_vs_rb_pct
rb alloc_dealloc 10 34553400.70 1310666.08 32439904 37218008 0.00 +0.00%
w2 alloc_dealloc 10 32305511.90 1422450.42 30445550 34602621 -2247888.80 -6.51%
w1 alloc_dealloc 10 30988844.30 1278594.28 28944532 33868143 -3564556.40 -10.32%
rb batch_alloc_dealloc 10 88992666.30 3468105.48 84547398 95170524 0.00 +0.00%
w2 batch_alloc_dealloc 10 68550446.70 2775490.79 64034170 71758820 -20442219.60 -22.97%
w1 batch_alloc_dealloc 10 69224794.00 1657977.26 66713419 71818604 -19767872.30 -22.21%
rb alloc_touch_dealloc 10 37034922.30 1461485.12 35165642 39571968 0.00 +0.00%
w2 alloc_touch_dealloc 10 33708889.80 1158505.79 31419122 35025455 -3326032.50 -8.98%
w1 alloc_touch_dealloc 10 32279160.80 1207555.98 30970969 34691720 -4755761.50 -12.84%

Local Hyperfine (10 runs)

Command Mean [ms] Min [ms] Max [ms] Relative
./build-rb/perf-large_alloc-fast 162.5 ± 4.8 157.3 169.7 1.03 ± 0.08
./build-w2/perf-large_alloc-fast 157.2 ± 10.9 148.8 182.5 1.00
./build-w1/perf-large_alloc-fast 176.3 ± 12.7 157.7 191.0 1.12 ± 0.11

Spark (aarch64) 10-Run Metric Stats (ns)

variant metric n mean_ns stddev_ns min_ns max_ns delta_vs_rb_ns delta_vs_rb_pct
rb alloc_dealloc 10 36912757.90 8786222.89 32479343 54836176 0.00 +0.00%
w2 alloc_dealloc 10 46133636.00 9692479.56 32085710 57150088 9220878.10 +24.98%
w1 alloc_dealloc 10 39381360.00 10398906.56 30973994 55649475 2468602.10 +6.69%
rb batch_alloc_dealloc 10 120928117.90 4419886.10 115008482 125408191 0.00 +0.00%
w2 batch_alloc_dealloc 10 88191151.10 204937.33 87790801 88468595 -32736966.80 -27.07%
w1 batch_alloc_dealloc 10 87219709.70 315244.24 86645261 87587249 -33708408.20 -27.87%
rb alloc_touch_dealloc 10 32501404.10 100263.29 32212927 32582160 0.00 +0.00%
w2 alloc_touch_dealloc 10 32077673.10 152537.46 31795581 32269678 -423731.00 -1.30%
w1 alloc_touch_dealloc 10 31466960.90 102171.68 31324316 31684637 -1034443.20 -3.18%

Spark Hyperfine (10 runs)

Command Mean [ms] Min [ms] Max [ms] Relative
./build-rb/perf-large_alloc-fast 192.3 ± 9.3 180.9 206.2 1.24 ± 0.09
./build-w2/perf-large_alloc-fast 156.5 ± 6.3 152.3 168.6 1.01 ± 0.07
./build-w1/perf-large_alloc-fast 155.2 ± 8.0 149.6 169.4 1.00

Notes

  • In-program metric timers (ns lines from perf-large_alloc-fast) and hyperfine wall-time can rank variants differently.
  • This report is post-fix and supersedes earlier numbers from intermediate iterations.

@SchrodingerZhu
Copy link
Collaborator

While the data structure should be correctly implemented, the codex's code appears very adhoc so I may need to craft this change by hands. Given that the 1-bit rank parity approach seems to be the most promising solution, I will just retain that single implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants